250 research outputs found
Locality and compositionality in representation learning for complex visual tasks
L'utilisation d'architectures neuronales profondes associée à des innovations spécifiques telles que les méthodes adversarielles, l’entraînement préalable sur de grands ensembles de données et l'estimation de l'information mutuelle a permis, ces dernières années, de progresser rapidement dans de nombreuses tâches de vision par ordinateur complexes telles que la classification d'images de catégories préalablement inconnues (apprentissage zéro-coups), la génération de scènes ou la classification multimodale. Malgré ces progrès, il n’est pas certain que les méthodes actuelles d’apprentissage de représentations suffiront à atteindre une performance équivalente au niveau humain sur des
tâches visuelles arbitraires et, de fait, cela pose des questions quant à la direction de la recherche future.
Dans cette thèse, nous nous concentrerons sur deux aspects des représentations qui semblent nécessaires pour atteindre de bonnes performances en aval pour l'apprentissage des représentations : la localité et la compositionalité. La localité peut être comprise comme la capacité d'une représentation à retenir des informations locales. Ceci sera pertinent dans de nombreux cas, et bénéficiera particulièrement à la vision informatique, domaine dans lequel les images naturelles comportent intrinsèquement des informations locales, par exemple des parties pertinentes d’une image, des objets multiples présents dans une scène... D'autre part, une représentation compositionnelle peut être comprise comme une représentation qui résulte d'une combinaison de parties plus simples. Les réseaux neuronaux convolutionnels sont intrinsèquement compositionnels, et de nombreuses images complexes peuvent être considérées comme la composition de sous-composantes pertinentes : les objets et attributs individuels dans une scène, les attributs sémantiques dans l'apprentissage zéro-coups en sont deux exemples. Nous pensons que ces deux propriétés détiennent la clé pour concevoir de meilleures méthodes d'apprentissage de représentations.
Dans cette thèse, nous présentons trois articles traitant de la localité et/ou de la compositionnalité, et de leur application à l'apprentissage de représentations pour des tâches visuelles complexes.
Dans le premier article, nous introduisons des méthodes de mesure de la localité et de la compositionnalité pour les représentations d'images, et nous démontrons que les représentations locales et compositionnelles sont plus performantes dans l'apprentissage zéro-coups. Nous utilisons également ces deux notions comme base pour concevoir un nouvel algorithme d'apprentissage des représentations qui atteint des performances de pointe dans notre cadre expérimental, une variante de l'apprentissage "zéro-coups" plus difficile où les informations externes, par exemple un pré-entraînement sur d'autres ensembles de données d'images, ne sont pas autorisées.
Dans le deuxième article, nous montrons qu'en encourageant un générateur à conserver des informations locales au niveau de l'objet, à l'aide d'un module dit de similarité de graphes de scène, nous pouvons améliorer les performances de génération de scènes. Ce modèle met également en évidence l'importance de la composition, car de nombreux composants fonctionnent individuellement sur chaque objet présent. Pour démontrer pleinement la portée de notre approche, nous effectuons une analyse détaillée et proposons un nouveau cadre pour évaluer les modèles de génération de scènes.
Enfin, dans le troisième article, nous montrons qu'en encourageant une forte information mutuelle entre les représentations multimodales locales et globales des images médicales en 2D et 3D, nous pouvons améliorer la classification et la segmentation des images. Ce cadre général peut être appliqué à une grande variété de contextes et démontre les avantages non seulement de la localité, mais aussi de la compositionnalité, car les représentations multimodales sont combinées pour obtenir une représentation plus générale.The use of deep neural architectures coupled with specific innovations such as adversarial methods, pre-training on large datasets and mutual information estimation has in recent years allowed rapid progress in many complex vision tasks such as zero-shot learning, scene generation, or multi-modal classification. Despite such progress, it is still not clear if current representation learning methods will be enough to attain human-level performance on arbitrary visual tasks, and if not, what direction should future research take.
In this thesis, we will focus on two aspects of representations that seem necessary to achieve good downstream performance for representation learning: locality and compositionality. Locality can be understood as a representation's ability to retain local information. This will be relevant in many cases, and will specifically benefit computer vision where natural images inherently feature local information, i.e. relevant patches of an image, multiple objects present in a scene... On the other hand, a compositional representation can be understood as one that arises from a combination of simpler parts. Convolutional neural networks are inherently compositional, and many complex images can be seen as composition of relevant sub-components: individual objects and attributes in a scene, semantic attributes in zero-shot learning are two examples. We believe both properties hold the key to designing better representation learning methods.
In this thesis, we present 3 articles dealing with locality and/or compositionality, and their application to representation learning for complex visual tasks.
In the first article, we introduce ways of measuring locality and compositionality for image representations, and demonstrate that local and compositional representations perform better at zero-shot learning. We also use these two notions as the basis for designing class-matching deep info-max, a novel representation learning algorithm that achieves state-of-the-art performance on our proposed "Zero-shot from scratch" setting, a harder zero-shot setting where external information, e.g. pre-training on other image datasets is not allowed.
In the second article, we show that by encouraging a generator to retain local object-level information, using a scene-graph similarity module, we can improve scene generation performance. This model also showcases the importance of compositionality as many components operate individually on each object present. To fully demonstrate the reach of our approach, we perform detailed analysis, and propose a new framework to evaluate scene generation models.
Finally, in the third article, we show that encouraging high mutual information between local and global multi-modal representations of 2D and 3D medical images can lead to improvements in image classification and segmentation. This general framework can be applied to a wide variety of settings, and demonstrates the benefits of not only locality, but also of compositionality as multi-modal representations are combined to obtain a more general one
Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting
The performance of time series forecasting has recently been greatly improved
by the introduction of transformers. In this paper, we propose a general
multi-scale framework that can be applied to the state-of-the-art
transformer-based time series forecasting models (FEDformer, Autoformer, etc.).
By iteratively refining a forecasted time series at multiple scales with shared
weights, introducing architecture adaptations, and a specially-designed
normalization scheme, we are able to achieve significant performance
improvements, from 5.5% to 38.5% across datasets and transformer architectures,
with minimal additional computational overhead. Via detailed ablation studies,
we demonstrate the effectiveness of each of our contributions across the
architecture and methodology. Furthermore, our experiments on various public
datasets demonstrate that the proposed improvements outperform their
corresponding baseline counterparts. Our code is publicly available in
https://github.com/BorealisAI/scaleformer
What Constitutes Good Contrastive Learning in Time-Series Forecasting?
In recent years, the introduction of self-supervised contrastive learning
(SSCL) has demonstrated remarkable improvements in representation learning
across various domains, including natural language processing and computer
vision. By leveraging the inherent benefits of self-supervision, SSCL enables
the pre-training of representation models using vast amounts of unlabeled data.
Despite these advances, there remains a significant gap in understanding the
impact of different SSCL strategies on time series forecasting performance, as
well as the specific benefits that SSCL can bring. This paper aims to address
these gaps by conducting a comprehensive analysis of the effectiveness of
various training variables, including different SSCL algorithms, learning
strategies, model architectures, and their interplay. Additionally, to gain
deeper insights into the improvements brought about by SSCL in the context of
time-series forecasting, a qualitative analysis of the empirical receptive
field is performed. Through our experiments, we demonstrate that the end-to-end
training of a Transformer model using the Mean Squared Error (MSE) loss and
SSCL emerges as the most effective approach in time series forecasting.
Notably, the incorporation of the contrastive objective enables the model to
prioritize more pertinent information for forecasting, such as scale and
periodic relationships. These findings contribute to a better understanding of
the benefits of SSCL in time series forecasting and provide valuable insights
for future research in this area. Our codes are available at
https://github.com/chiyuzhang94/contrastive_learning_time-series_e2e.Comment: Accepted at IJCAI'22 Workshop-AI4TS: AI for Time Series Analysi
Robust Reinforcement Learning Objectives for Sequential Recommender Systems
Attention-based sequential recommendation methods have demonstrated promising
results by accurately capturing users' dynamic interests from historical
interactions. In addition to generating superior user representations, recent
studies have begun integrating reinforcement learning (RL) into these models.
Framing sequential recommendation as an RL problem with reward signals, unlocks
developing recommender systems (RS) that consider a vital aspect-incorporating
direct user feedback in the form of rewards to deliver a more personalized
experience. Nonetheless, employing RL algorithms presents challenges, including
off-policy training, expansive combinatorial action spaces, and the scarcity of
datasets with sufficient reward signals. Contemporary approaches have attempted
to combine RL and sequential modeling, incorporating contrastive-based
objectives and negative sampling strategies for training the RL component. In
this study, we further emphasize the efficacy of contrastive-based objectives
paired with augmentation to address datasets with extended horizons.
Additionally, we recognize the potential instability issues that may arise
during the application of negative sampling. These challenges primarily stem
from the data imbalance prevalent in real-world datasets, which is a common
issue in offline RL contexts. While our established baselines attempt to
mitigate this through various techniques, instability remains an issue.
Therefore, we introduce an enhanced methodology aimed at providing a more
effective solution to these challenges
Object-Centric Image Generation from Layouts
Despite recent impressive results on single-object and single-domain image
generation, the generation of complex scenes with multiple objects remains
challenging. In this paper, we start with the idea that a model must be able to
understand individual objects and relationships between objects in order to
generate complex scenes well. Our layout-to-image-generation method, which we
call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a
novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of
the spatial relationships between objects in the scene, which lead to our
model's improved layout-fidelity. We also propose changes to the conditioning
mechanism of the generator that enhance its object instance-awareness. Apart
from improving image quality, our contributions mitigate two failure modes in
previous approaches: (1) spurious objects being generated without corresponding
bounding boxes in the layout, and (2) overlapping bounding boxes in the layout
leading to merged objects in images. Extensive quantitative evaluation and
ablation studies demonstrate the impact of our contributions, with our model
outperforming previous state-of-the-art approaches on both the COCO-Stuff and
Visual Genome datasets. Finally, we address an important limitation of
evaluation metrics used in previous works by introducing SceneFID -- an
object-centric adaptation of the popular Fr{\'e}chet Inception Distance metric,
that is better suited for multi-object images.Comment: AAAI 202
- …